Where is the data from?
The data is available on the website of OPHI. This is the link.
The Table 1 has national MPI results for the year 2019.
Libraries
## here() starts at D:/R projects/MPI OPHI/shiny_web_application_MPI
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
## -- Attaching packages --------
## v ggplot2 3.2.1 v purrr 0.3.3
## v tibble 2.1.3 v dplyr 0.8.5
## v tidyr 1.0.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## Warning: package 'dplyr' was built under R version 3.6.3
## Warning: package 'forcats' was built under R version 3.6.3
## -- Conflicts -----------------
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
To know why I use {here}, go to this link
Importing Data
The Table 1 has 7 sheets. names and numbered as following (as given in the excel file):
1.1 National MPI Results:
The table sorts 101 countries from low to high according to the Multidimensional Poverty Index (MPI) and presents the basic MPI results.
1.2 Censored Headcounts
The table shows the proportion of people who are MPI poor and experience deprivations in each of the indicators. The table sorts 101 countries from low to high according to the Multidimensional Poverty Index (MPI).
1.3 Contribution of Deprivation
The table shows the percentage contribution of each dimension and indicator to the MPI. The table sorts 101 countries from low to high according to the Multidimensional Poverty Index (MPI).
1.4 MPI Results and complete data
The table sorts 101 countries from low to high according to their Multidimensional Poverty Index (MPI) score and compares that with external data such as income poverty figures, GNI per capita, Gini Index, HDI rank and income category.
1.5 SEs and CIs
The table sorts countries from low to high according to the Multidimensional Poverty Index (MPI) and presents the standard errors and confidence intervals for the MPI and the headcount ratio.
1.6 Raw Headcounts
The table shows the proportion of people who experience deprivations in each of the indicators. The table sorts 101 countries from low to high according to the Multidimensional Poverty Index (MPI).
1.7 Sample size and Non-Response
The table reports the sample sizes from each survey that were used to compute the MPI. Reductions in sample sizes were due to missing data, as detailed in the columns by indicator. The table sorts 101 countries from low to high according to the Multidimensional Poverty Index (MPI).
From these 7 sheets the first 4 will be data of our focus. I will import and clean only the first 4 sheets.
Importing 1.1 National MPI Results.
Before writing code for the data I like to open in it excel just to get the feel. One can do this from RStudi as well, however, I prefer looking at the data once in excel id possible. Here is how the data looks.
1.1 National MPI Results in Excel
As one can see that in order to improt the data in a clean manner we will need to skip some inintial rows. Moreover, we will have to think of how to deal with merged cells in the header or column name.
One way of dealing with this is skipping first 9 rows and naming the columns in the next step. This sounds tedious though. The tediousness would become more amplified as the number of columns increase. What to do?
The other way is to skip only the first 4 rows and see how the column names are read. Let us try this first and see how this goes.
## New names:
## * `` -> ...6
## * `` -> ...8
## * `` -> ...9
## * `` -> ...10
## * `` -> ...11
## * ... and 6 more problems
## [1] "ISO\r\ncountry numeric code" "ISO\r\ncountry code"
## [3] "Country" "World region"
## [5] "MPI data source" "...6"
## [7] "Multidimensional poverty" "...8"
## [9] "...9" "...10"
## [11] "...11" "...12"
## [13] "Total population<U+1D43>" "...14"
## [15] "...15" "Number of MPI poor people<U+1D47>"
## [17] "...17" "...18"
## [19] "Indicators included in the MPI" "...20"
Above shown are the names of the columns that are read by default. Had we skipped all the 9 rows we would have to name all the columns. By skipping 4 rows instead of 9 we will not need to do that for all columns. Madhyam Marg!?
## New names:
## * `` -> ...6
## * `` -> ...8
## * `` -> ...9
## * `` -> ...10
## * `` -> ...11
## * ... and 6 more problems
dat_table1 %>%
rename(
"MPI data source survey" =`MPI data source`,
"MPI data source survey year" = ...6,
"MPI" = `Multidimensional poverty`,
"Headcount ratio" = ...8,
"Intensity" = ...9,
"vulnerable to poverty" = ...10,
"in severe povertty" = ...11,
"Inequality among the poor" = ...12,
#"Population in survey given year" = `Total populationᵃ`,
"Population 2016" = ...14,
"Population 2017" = ...15,
#"Number of MPI poor in given year" = `Number of MPI poor peopleᵇ`,
"Number of MPI poor in 2016" = ...17,
"Number of MPI poor in 2017" = ...18,
"Total number of indicators included" = `Indicators included in the MPI`,
"Number of missing indicators" = ...20
) -> dat_table1
# \Uxxx sequences are not allowed inside backticks. We will change those two columns with a different approach
names(dat_table1)## [1] "ISO\r\ncountry numeric code" "ISO\r\ncountry code"
## [3] "Country" "World region"
## [5] "MPI data source survey" "MPI data source survey year"
## [7] "MPI" "Headcount ratio"
## [9] "Intensity" "vulnerable to poverty"
## [11] "in severe povertty" "Inequality among the poor"
## [13] "Total population<U+1D43>" "Population 2016"
## [15] "Population 2017" "Number of MPI poor people<U+1D47>"
## [17] "Number of MPI poor in 2016" "Number of MPI poor in 2017"
## [19] "Total number of indicators included" "Number of missing indicators"
names(dat_table1) -> names_col_dat_tab1
names_col_dat_tab1[13] <- "Population as per survey in given year"
names_col_dat_tab1[16] <- "Number of MPI Poor in given year"
colnames(x = dat_table1) <- names_col_dat_tab1
names(dat_table1)## [1] "ISO\r\ncountry numeric code"
## [2] "ISO\r\ncountry code"
## [3] "Country"
## [4] "World region"
## [5] "MPI data source survey"
## [6] "MPI data source survey year"
## [7] "MPI"
## [8] "Headcount ratio"
## [9] "Intensity"
## [10] "vulnerable to poverty"
## [11] "in severe povertty"
## [12] "Inequality among the poor"
## [13] "Population as per survey in given year"
## [14] "Population 2016"
## [15] "Population 2017"
## [16] "Number of MPI Poor in given year"
## [17] "Number of MPI poor in 2016"
## [18] "Number of MPI poor in 2017"
## [19] "Total number of indicators included"
## [20] "Number of missing indicators"
Above are the names that we want. However, the way the column names are written (case) is not desirable. This reminds me of a wonderful aRtwork that Allison Horst made. Here it is.
This is so cool
The janitor package can help us get the column names in a desirable case.
## [1] "iso_country_numeric_code"
## [2] "iso_country_code"
## [3] "country"
## [4] "world_region"
## [5] "mpi_data_source_survey"
## [6] "mpi_data_source_survey_year"
## [7] "mpi"
## [8] "headcount_ratio"
## [9] "intensity"
## [10] "vulnerable_to_poverty"
## [11] "in_severe_povertty"
## [12] "inequality_among_the_poor"
## [13] "population_as_per_survey_in_given_year"
## [14] "population_2016"
## [15] "population_2017"
## [16] "number_of_mpi_poor_in_given_year"
## [17] "number_of_mpi_poor_in_2016"
## [18] "number_of_mpi_poor_in_2017"
## [19] "total_number_of_indicators_included"
## [20] "number_of_missing_indicators"
Now we will write a file of that has relevant columns.
Here is how the cleaned table1.1 of workbook Table1 Looks.
Cleaned Table1.1 of Table1